To work with Arango in Python we simply need to leverage the python-arango package (pip install python-arango
and import arango
)
# We'll import networkx and matplotlib for some light visualizations
import json
import networkx as nx
import matplotlib.pyplot as plt
from arango import ArangoClient
Like almost every other database we connect to, we need to connect to the server. With Arango, we use the arango.ArangoClient.
We simply need to provide:
Once connected we can access a given database by passing in our credentials client.db()
Finally we can also connect to a graph, by name, by using db.graph()
# Our client connection
client = ArangoClient(hosts='http://18.219.151.47:8529')
# Our database connection
db = client.db('emse6586', username='root', password='emse6586pass')
Once connected to a database, we can query the database using AQL, simply by executing it.
db.aql.execute(AQL_query)
query = """FOR tweet IN statuses
LIMIT 10
RETURN tweet"""
results = db.aql.execute(query)
print(results)
Like other connections, we get back curosr objects from our executed queries. We can access these exactly the same way we interact with other cursors
tweets = list(results)
print(tweets[0])
Given the dictionary-like structure of the objects, they can be easily loaded into a DataFrame
import pandas as pd
df = pd.DataFrame(tweets)
df.head()
Given Arango is built around traversing graphs and graph-like queries, pyarango provides a streamlined API. However, to use the simplified API requires a graph to have been defined within Arango.
If you recall from the lecture, we have one called twitter_sphere which we can hook into. The graph can be initialized by db.graph({graph_name})
# Our graph connection
graph = db.graph('twitter_sphere')
results = graph.traverse(max_depth=2, direction='any', start_vertex='users/22203756', vertex_uniqueness='global')
print(results.keys())
The main difference with this resultset is the format of the datastructure. Given that it is a graph traversal there are three different subelements:
Each of these elements has it's own structure that will mirror those we saw when reviewing graph traversals.
print(f'Number of paths: {len(results["paths"])}')
print(f'Number of vertices: {len(results["vertices"])}')
print(json.dumps(results['paths'][4000], indent=2))
for edge in results['paths'][4000]['edges']:
print(edge)
Traversals return vertices, paths, and edges (depending on the traversal)
Using networkx and matplotlib, we can actually plot some of these graphs
def populate_from_query(results, G, limit=100):
"""Given results from a query populate a networkx graph
Args:
results (list/dict) - Results from an AQL graph
G (networkx.Graph) - A networkx graph
limit (int) - Limit to number of nodes/edges to populate
"""
edge_count = 0
for result in results['paths']:
nodes = result['vertices']
edges = result['edges']
for edge in edges:
if edge_count % 100 == 0:
print(f'{edge_count} of {len(edges)}')
from_user = edge['_from']
to_user = edge['_to']
for node in nodes:
if node['_id'] == from_user:
if 'screen_name' in node:
from_node = node['screen_name']
elif 'status_id' in node:
from_node = node['status_id']
if node['_id'] == to_user:
if 'screen_name' in node:
to_node = node['screen_name']
elif 'status_id' in node:
to_node = node['status_id']
G.add_edge(from_node, to_node)
edge_count += 1
if edge_count > limit and limit != -1:
return
fig, ax = plt.subplots(1, 1, figsize=(16, 14));
G = nx.Graph(ax=ax)
populate_from_query(results, G, 25)
pos = nx.spring_layout(G, k=.01)
nx.draw_networkx_nodes(G, pos, node_color='red', alpha=0.7, node_size=500)
nx.draw_networkx_edges(G, pos, edge_color='gray', alpha=0.5)
nx.draw_networkx_labels(G, pos, font_weight='bold', font_size=12, font_color='black')
#nx.draw(G, pos, font_size=16, with_labels=True)
for p in pos: # raise text positions
pos[p][1] += 0.07
Let's apply this kind of logic to solve a more interesting, and complex, problem.
In our twitter data we'll focus on two people, and the relationships that connect them together:
To start we need to write our query to identify the paths that link our two people together.
We'll focus on just friendships and only allow our traversal to search a depth of 2. Meaning that we will only allow one intermediary friend to link our two people together.
# Space for the query
Because our data treats friendship as a single direction, we can end up with paths that touch the same vertices (while techincally being a separate path). So let's clean it up to see how many actual intermediate people connect our two users.
Identify the unique vertexes for the query:
# Space to identify unique connections
Now lets plot the data to see if our findings are corroborated by the graphs visualization. We can hook into the previously created populate_from_query function, however you will need to modify the results to work.
# Space for our graph
fig, ax = plt.subplots(1, 1, figsize=(14, 8));
G = nx.Graph(ax=ax)
populate_from_query({'paths': paths}, G, 150)
pos = nx.spring_layout(G, k=.3)
nx.draw_networkx_nodes(G, pos, node_color='red', alpha=0.7, node_size=500)
nx.draw_networkx_edges(G, pos, edge_color='gray', alpha=0.5)
nx.draw_networkx_labels(G, pos, font_weight='bold', font_size=12, font_color='black')
#nx.draw(G, pos, font_size=16, with_labels=True)
for p in pos: # raise text positions
pos[p][1] += 0.07
How would we do this same analysis using SQL?
Do you think that this would be harder or easier?
Looking at only one jump wasn't to complicated when jumping back and forth between SQL and Arango (graphs), but how about expanding to enabling two jumps?
Change our AQL to search for the connections between Trump and The Rock with 1 to 2 intermediary friends:
# Updated query
How does changing AQL compare to what we would have to do to change the SQL query?
#Handles one intermediary friend
Select * FROM Friends
Where friend_from = "tim_cook"
AND friend_to IN (SELECT friend_from FROM friends WHERE friend_to='elonmusk') as inter_1
OR friend_to IN (SELECT friend_from FROM inter_1) as inter_2
OR